Name: [jonathan-karl]

Exercise 1 (5 points)

Using the file posts.csv in this repo (the sample of 10,000 public Facebook posts by members of the US congress from 2017 which we also used in week 8), solve the following with dplyr:

# Load necessary package
suppressMessages(library(dplyr))

# Load data into dataframe
posts_raw <- read.csv("data/posts.csv", stringsAsFactors = FALSE)

# Do not consider posts with zero likes, Compute the comment to like ratio for each post (i.e. comments_count / likes_count), Compute the average comment to like ratio for each politician (screen_name), Subtract the corresponding politician's average comment to like ratio from the comment to like ratio of each post, Sort the result in descending order of the de-meaned comment to like ratios.
posts_adjstd <- posts_raw %>%
  filter(likes_count != 0) %>%
  mutate(comment_like_ratio = comments_count / likes_count) %>%
  group_by(screen_name) %>%
  mutate(average_comment_like_ratio = mean(comment_like_ratio)) %>%
  ungroup() %>%
  mutate(demeaned_comment_like_ratio = comment_like_ratio - average_comment_like_ratio) %>%
  arrange(desc(demeaned_comment_like_ratio))

# Report the screen name and value of the 10 posts with the highest value of the de-meaned comment to like ratio.
posts_adjstd %>%
  #select(screen_name, demeaned_comment_like_ratio) %>%
  head(n = 10L)
## # A tibble: 10 x 18
##    screen_name date  post_type message likes_count comments_count shares_count
##    <chr>       <chr> <chr>     <chr>         <int>          <int>        <int>
##  1 susancolli… 2017… photo     "Senat…          97           1719           28
##  2 Congresswo… 2017… status    "State…         204           3408          142
##  3 senatortoo… 2017… link      "You c…          20            298           29
##  4 RepClaudia… 2017… link      "We'll…          22            243           25
##  5 SenDuckwor… 2017… video     "HAPPE…        1551          13838         5718
##  6 RepJoeBart… 2017… link      "And i…          96            964           18
##  7 RepMullin   2017… event      <NA>             1              8            0
##  8 RepMikeBis… 2017… link      "Speak…          12            105            0
##  9 AndyHarris… 2017… link      "My se…           3             26            0
## 10 Congressma… 2017… photo     "I had…          10             72            3
## # … with 11 more variables: love_count <int>, haha_count <int>,
## #   wow_count <int>, angry_count <int>, sad_count <int>, gender <chr>,
## #   type <chr>, party <chr>, comment_like_ratio <dbl>,
## #   average_comment_like_ratio <dbl>, demeaned_comment_like_ratio <dbl>

Exercise 2 (7 points)

When scraping websites, it is not always necessary to find out the CSS selector or XPath of elements that are of interest. With your knowledge of regular expressions, you can sometimes scrape the full HTML content of a website and extract all relevant parts from it with a regular expression in one go. This exercise is about such a case.

  1. With the rvest package, scrape the HTML content of the LSE’s graduate course site https://www.lse.ac.uk/resources/calendar/courseGuides/graduate.htm and transform it into one large text with html_text(). (Do not use the html_nodes() function or any selectors)

  2. With stringr from the tidyverse, define and use a regular expression that extracts only the course codes from this text. In the end you should store the course codes in a character vector (not the names of the courses, just their codes have to be extracted with the regular expression).

  3. Print the first 200 course codes, this should be a vector starting with “AC411, AC412, AC415, …”

  4. How many unique graduate course codes are there in total? There are 1072 graduate course codes in total.

# Load necessary libraries
suppressMessages(library(stringr))
suppressMessages(library(rvest))

# a) Scrape the HTML content of the LSE's graduate course site and transform it into one large text
url <- "https://www.lse.ac.uk/resources/calendar/courseGuides/graduate.htm"
lse_html <- read_html(url)
html_courses <- html_text(lse_html)

# b) Define and use a regular expression that extracts only the course codes from this text. Store the course codes in a character vector.
all_courses <- unlist(str_extract_all(html_courses, "[A-Z]{2}[0-9]+[A-Z0-9]{1,3}"))

# c) Print the first 200 course codes, this should be a vector starting with "AC411, AC412, AC415, ..." 
print(all_courses[1:200])
##   [1] "AC411"  "AC412"  "AC415"  "AC416"  "AC417"  "AC424"  "AC425"  "AC444" 
##   [9] "AC470"  "AC480"  "AC490"  "AC491"  "AC493"  "AC499"  "AN402"  "AN404" 
##  [17] "AN405"  "AN419"  "AN420"  "AN424"  "AN436"  "AN442"  "AN443"  "AN444" 
##  [25] "AN447"  "AN451"  "AN456"  "AN457"  "AN458"  "AN461"  "AN463"  "AN467" 
##  [33] "AN469"  "AN471"  "AN472"  "AN473"  "AN475"  "AN476"  "AN477"  "AN478" 
##  [41] "AN479"  "AN480"  "AN481"  "AN497"  "AN498"  "AN499"  "DV400"  "DV407" 
##  [49] "DV410"  "DV411"  "DV413"  "DV415"  "DV418"  "DV420"  "DV421"  "DV423" 
##  [57] "DV424"  "DV428"  "DV431"  "DV432"  "DV433"  "DV434"  "DV435"  "DV442" 
##  [65] "DV444"  "DV445"  "DV447"  "DV453"  "DV454"  "DV455"  "DV456"  "DV457" 
##  [73] "DV458"  "DV460"  "DV461"  "DV462"  "DV463"  "DV464"  "DV472"  "DV480" 
##  [81] "DV483"  "DV490"  "DV491"  "DV492"  "EC400"  "EC402"  "EC411"  "EC413" 
##  [89] "EC417"  "EC421"  "EC423"  "EC424"  "EC426"  "EC427"  "EC428"  "EC441" 
##  [97] "EC442"  "EC443"  "EC451"  "EC452E" "EC453"  "EC465"  "EC475"  "EC476" 
## [105] "EC484"  "EC485"  "EC486"  "EC487"  "EC4B5"  "EC4B6"  "EH401"  "EH402" 
## [113] "EH404"  "EH409"  "EH413"  "EH421"  "EH423"  "EH426L" "EH426M" "EH427" 
## [121] "EH428"  "EH429"  "EH430"  "EH446"  "EH452"  "EH454"  "EH457"  "EH463" 
## [129] "EH472"  "EH473"  "EH474"  "EH476"  "EH479"  "EH480"  "EH481"  "EH482" 
## [137] "EH483"  "EH486"  "EH496"  "EH497"  "EH498"  "EH499"  "EU409"  "EU410" 
## [145] "EU421"  "EU425"  "EU430"  "EU432"  "EU437"  "EU439"  "EU440"  "EU443" 
## [153] "EU446"  "EU447"  "EU449"  "EU450"  "EU453"  "EU455"  "EU457"  "EU458" 
## [161] "EU464"  "EU467"  "EU468"  "EU469"  "EU475"  "EU476"  "EU477"  "EU478" 
## [169] "EU481"  "EU482"  "EU484"  "EU485"  "EU486"  "EU487"  "EU488"  "EU489" 
## [177] "EU490"  "EU491"  "EU492"  "EU495"  "EU499"  "EU4A1"  "EU4A2"  "EU4A4" 
## [185] "EU4A5"  "EU4C9"  "EU4V9"  "FM402"  "FM403"  "FM404"  "FM405"  "FM405E"
## [193] "FM406"  "FM406E" "FM407"  "FM407E" "FM408"  "FM408E" "FM409"  "FM409E"
# d) How many unique graduate course codes are there in total?
length(all_courses)
## [1] 1072

Exercise 3 (15 points)

Using the Twitter streaming API, collect tweets for some hours or a couple of days with a filter of one or more words you choose related to lockdown. Store these tweets in a file such that you do not have to collect them again later when knitting the document. Then download the following sentiment dictionary from the University of Pittsburgh (MPQA), and process the data such that all positive words are stored in a character vector and all negative words are stored in another character vector. Make sure to delete all duplicated words within each vector. Afterwards use quanteda for the textual analysis. You can build a corpus and document-feature matrices (dfms) based on the tweet texts you collected and you can create a quanteda (sentiment) dictionary with positive and negative sentiment words based on your two vectors. Then find all tweets which contain at least one of the sentiment keywords. For simplicity, label every tweet as positive that contains at least one of the positive keywords and none of the negative ones, and every tweet as negative that contains at least one of the negative keywords and none of the positive ones. For tweets with both types of keywords, label tweets as positive if they contain more positive keywords and as negative if they contain more negative keywords. All tweets without keywords are labeled as neutral, and tweets containing both kinds of keywords in equal numbers are assumed to be neutral as well.

  1. How many unique positive and negative sentiment keywords do your two sentiment word vectors contain after you have loaded and cleaned the MPQA data? Also print the first 10 words from each vector. There are 2304 unique positive words and 4153 unique negative words.

  2. How many lockdown related tweets did you collect and which keyword(s) did you use to filter? Keywords used to filter: StayHomeStaySafe, StayHome, Covid19, covid19, Covid-19, covid-19, Coronavirus, Lockdown, lockdown, covid, Covid

  3. Through your textual analysis, what share of all lockdown related tweets do you find as having negative sentiment, what share as positive sentiment?

  4. Visualise the content of all positive lockdown tweets in a word cloud and the content of all negative lockdown tweets in another word cloud. To make this more informative, do not depict the following words: i) stopwords, ii) the filter words you used to defined lockdown tweets, and iii) the sentiment words (because these words will be contained in the tweets by construction and might otherwise bias the word clouds).

Hint: The sentiment list comes in a somewhat nonstandard text based file subjclueslen1-HLTEMNLP05.tff - have a look at it in your text editor. Such issues are very frequent when working with data, so it is good to study how to solve them as well. One approach is to load the file specifying that the column delimiter is a space, i.e. read.csv("subjclueslen1-HLTEMNLP05.tff", sep = " ", header = FALSE). The only columns that you need afterwards contain words (word1) and sentiment (priorpolarity). You can keep only these two columns from the data frame and replace “word1=” and “priorpolarity=” in each line with an empty string "“. Afterwards you can create your two character vectors with unique positive and negative sentiment words. A different approach would be to open the file subjclueslen1-HLTEMNLP05.tff in a text editor and find-replace”type=“,”len=“,”word1=“,”pos1=“,”stemmed1=“,”priorpolarity=“, etc. all with”" before saving the file again and loading it into R.

Note: If your two word clouds are not too different, the likely reason is that we are classifying tweets into positive and negative with a very simple heuristic here (i.e. by checking whether they contain one or more keywords from the dictionary).

#Using the Twitter streaming API, collect tweets for some hours or a couple of days with a filter of one or more words you choose related to `lockdown`. This analysis uses tweets only in the english language.

# Load libraries
suppressMessages(library(rtweet))
suppressMessages(library(quanteda))

#####################
#Before running this chunk, make sure a list of 5 elements ("app", "consumer_key", "consumer_secret", "access_token" and "access_secret") called "authentication" exists in your global environment. This list should contain all necessary information to set up a twitter token.
#####################

# Create Twitter token
load("authentication.rda")
twitter_token <- create_token(app = authentication$app,
                              consumer_key = authentication$consumer_key,
                              consumer_secret = authentication$consumer_secret,
                              access_token = authentication$access_token,
                              access_secret = authentication$access_token_secret)

# Stream tweets in which mention the keywords and write into a json file.
keywords_twitter <- c("StayHomeStaySafe, StayHome, Covid19, covid19, Covid-19, covid-19, Coronavirus, Lockdown, lockdown, covid, Covid")
#lockdown_tweets <- stream_tweets(q = keywords_twitter, timeout = 600, file_name = "backup", language = "en")
lockdown_tweets <- suppressMessages(parse_stream("lockdown_tweets.json"))

#Read in sentiment lexicon
sentiment_lex <- read.delim("subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff", header = FALSE, sep = " ", col.names = c("type", "len", "word", "pos", "stemmed", "polarity")) %>%
  select(word, polarity)
sentiment_lex$word <- str_remove_all(sentiment_lex$word, "[a-z0-9]+=")
sentiment_lex$polarity <- str_remove_all(sentiment_lex$polarity, "[a-z0-9]+=")

# Create vector for unique positive words
positive <- sentiment_lex %>%
  filter(polarity == "positive")
positive_words <- unique(positive$word)

# Create vector for unique negative words
negative <- sentiment_lex %>%
  filter(polarity == "negative")
negative_words <- unique(negative$word)

# a) How many unique positive and negative sentiment keywords do your two sentiment word vectors contain after you have loaded and cleaned the MPQA data? Also print the first 10 words from each vector.
length(positive_words)
## [1] 2304
print(positive_words[1:10])
##  [1] "abidance"      "abide"         "abilities"     "ability"      
##  [5] "able"          "above"         "above-average" "abound"       
##  [9] "absolve"       "abundant"
length(negative_words)
## [1] 4153
print(negative_words[1:10])
##  [1] "abandoned"   "abandonment" "abandon"     "abase"       "abasement"  
##  [6] "abash"       "abate"       "abdicate"    "aberration"  "abhor"
# Set up dictionary based on positive and negative sentiment
sentiment_dictionary <- dictionary(list(positive = positive_words, negative = negative_words))

# Remove hashtags from text, create corpus, set up document-feature-matrix, convert to dataframe to ease analysis and label tweets as positive, neutral or negative
lockdown_tweets$text <- str_remove_all(lockdown_tweets$text, "#")
tweet_corpus <- corpus(lockdown_tweets, docid_field = "status_id")
tweet_dataframe_raw <- tweet_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  dfm(dictionary = sentiment_dictionary) %>%
  convert(to = "data.frame") 
  
tweet_dataframe <- tweet_dataframe_raw %>%
  mutate(sentiment = ifelse(tweet_dataframe_raw$positive > tweet_dataframe_raw$negative, "positive", ifelse(tweet_dataframe_raw$positive < tweet_dataframe_raw$negative, "negative", ifelse(tweet_dataframe_raw$positive == tweet_dataframe_raw$negative, "neutral", "Impossible")))) 

# c) compute the number of tweeets collected, and the shares of positive, neutral and negative tweets in the dataset.
tweet_dataframe %>% nrow()
## [1] 32688
tweet_dataframe %>%
  count(sentiment) %>%
  mutate(share_of_total = n/sum(n)*100)
##   sentiment     n share_of_total
## 1  negative  9694       29.65614
## 2   neutral 10180       31.14293
## 3  positive 12814       39.20093
# Reshape keyword vector to suit quanteda
keywords_quanteda <- unlist(strsplit(keywords_twitter, ", "))

# Remove the following words from corpus: i) stopwords, ii) the filter words you used to defined lockdown tweets, and iii) the sentiment words (because these words will be contained in the tweets by construction and might otherwise bias the word clouds).
tweet_dfm <- tweet_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE,  remove_url = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(pattern = stopwords("en"), case_insensitive = TRUE) %>%
  tokens_remove(pattern = keywords_quanteda, case_insensitive = TRUE) %>%
  tokens_remove(pattern = positive_words) %>%
  tokens_remove(pattern = negative_words) %>%
  dfm()

#Create vector and assign docvar in dfm to it in order to filter wordcloud content by sentiment
tweet_sentiment <- tweet_dataframe$sentiment
tweet_dfm$sentiment <- tweet_sentiment

# d) Visualise the content of all positive and negative lockdown tweets in two word clouds
textplot_wordcloud(dfm_subset(tweet_dfm, sentiment == "positive"), rotation=0, min_size=1.5, max_size=4, max_words=50)

textplot_wordcloud(dfm_subset(tweet_dfm, sentiment == "negative"), rotation=0, min_size=1.5, max_size=4, max_words=50, color = "red")

Exercise 4 (10 points)

The Archive API of the NYT allows to download all articles from a given month (up to one paragraph length). While you could use e.g. the rtweet package in Exercise 3, the goal now is to create an own function for the NYT API into which a user could input a year, e.g. 1918, and obtain a single concatenated data frame/tibble containing all data from this year.

  1. The data frame which the function returns should additionally contain two new columns: One named headline_processed and one named texts_combined. While the original headline column contains different pieces of information, the headline_processed column should be of type character and only contain the main text of the headline. The text_combined column should also be of type character and contain the concatenation of a row’s headline_processed and snippet separated by a space " ". Lastly, before returning the data frame, the function should delete the columns headline, byline, multimedia, and keywords. These are not required in the analysis here and contain nested lists and data frames which can make writing the files to disk as .csv later unnecessarily complicated.

  2. Use your function to download all article data for 1918 and 1919 in two data frames which will be used in later exercises.

  3. Print the first 10 rows of each of your two data frames for 1918 and 1919 selecting only the three columns id_, headline_processed, and text_combined. Also print out the total amount of columns and rows of each of the two data frames.

  4. Store the two data frames as .csv files on your computer such that you can load them later without downloading the data again. You can e.g. use the function write_csv() from the readr package.

Hint: Briefly read up about the rate limits of the NYT API in the FAQ and pause your script for some seconds before downloading the next month in the function.

# Load necessary libraries
suppressMessages(library("httr"))
suppressMessages(library("jsonlite"))
suppressMessages(library("tidyverse"))
suppressMessages(library("readr"))

#####################
#Before running this chunk, make sure have a key for the NYT API that is stored in a variable called apikey.
#####################
apikey <- "RsAkGtRAGuvpnKb7XqNRxbWp0uKKVnRU"

#Create function
return_yearly_archive_data <- function(year) {
  #Takes a between 1851 and 2020 as input, 
  #returns all NYT articles from the NYT archive API as a dataframe.
  df <- data.frame()
  for(i in seq(1,12)){
    archive_url <-sprintf("https://api.nytimes.com/svc/archive/v1/%g/%g.json?api-key=%s", year, i, apikey)
    raw <- GET(archive_url)
    json_as_text <- content(raw, "text")
    json <- fromJSON(json_as_text)
    df_new <- json$response$docs %>% as_tibble()
    df_new$headline <- df_new$headline$main
    df_new <- rename(df_new, "headline_processed" = "headline")
    df_new$text_combined <- paste(df_new$headline_processed, df_new$snippet, sep = " ")
    df_new <- subset(df_new, select = -c(byline, multimedia, keywords))
    df <- rbind(df, df_new)
    Sys.sleep(6)
  }
  assign("df", df, envir=.GlobalEnv)
  print(df)[1:10,c("_id", "headline_processed", "text_combined")]
}

# b) Use function to download all article data for 1918 and 1919 in two data frames
# c) Print the first 10 rows of each of data frame (1918 & 1919) selecting only the three columns `_id`, `headline_processed`, and `text_combined`. Further, print out the total amount of columns and rows of each data frames.
# d) Store the two data frames as `.csv` files
for(i in seq(1918,1919)){
  return_yearly_archive_data(i)
  print(paste("The dataframe with NYT articles from the year ", i," contains ", nrow(df), " rows and ", ncol(df), " columns.", sep = ""))
  write_csv(df, paste("NYT_archive_",i, "_df.csv", sep = ""))
  assign(paste("NYT_archive_",i, "_df", sep = ""),df)
}
## # A tibble: 73,161 x 17
##    abstract web_url snippet lead_paragraph print_page source headline_proces…
##    <chr>    <chr>   <chr>   <chr>          <chr>      <chr>  <chr>           
##  1 ""       https:… ""      ""             21         The N… Will They Go 90…
##  2 "offers… https:… "offer… ""             15         The N… MEET ROSENWALD'…
##  3 "Intern… https:… "Inter… ""             21         The N… READY TO GATHER…
##  4 "Govt. … https:… "Govt.… ""             20         The N… ARGENTINA TO SE…
##  5 "Fifth … https:… "Fifth… ""             14         The N… MENORAH THANKS …
##  6 "Statis… https:… "Stati… ""             3          The N… BRITISH CASUALT…
##  7 "editor… https:… "edito… ""             16         The N… THE WOMAN VOTER…
##  8 "Letter… https:… "Lette… ""             15         The N… WILSON BACKS LA…
##  9 "Equita… https:… "Equit… ""             5          The N… REFUSE $1,5000,…
## 10 "messag… https:… "messa… ""             3          The N… SEVEN ALLIES SE…
## # … with 73,151 more rows, and 10 more variables: pub_date <chr>,
## #   document_type <chr>, news_desk <chr>, section_name <chr>,
## #   type_of_material <chr>, `_id` <chr>, word_count <int>, uri <chr>,
## #   print_section <chr>, text_combined <chr>
## [1] "The dataframe with NYT articles from the year 1918 contains 73161 rows and 17 columns."
## # A tibble: 78,303 x 17
##    abstract web_url snippet lead_paragraph print_page source headline_proces…
##    <chr>    <chr>   <chr>   <chr>          <chr>      <chr>  <chr>           
##  1 ""       https:… ""      ""             20         The N… "As to Fine Yar…
##  2 "The br… https:… ""      ""             18         The N… "FINANCIAL MARK…
##  3 ""       https:… ""      ""             17         The N… "Obituary 4 -- …
##  4 ""       https:… ""      ""             17         The N… "Obituary 6 -- …
##  5 "Mayor … https:… "Mayor… ""             3          The N… "START A HYLAN …
##  6 "Rev J.… https:… "Rev J… ""             8          The N… "Want Kaiser's …
##  7 "McGove… https:… "McGov… ""             14         The N… "BOXER DIES OF …
##  8 "Differ… https:… "Diffe… ""             22         The N… "CONSIDER RAIL …
##  9 ""       https:… ""      ""             14         The N… "FAST WALKERS I…
## 10 ""       https:… ""      ""             20         The N… "Drug Trading I…
## # … with 78,293 more rows, and 10 more variables: pub_date <chr>,
## #   document_type <chr>, news_desk <chr>, section_name <chr>,
## #   type_of_material <chr>, `_id` <chr>, word_count <int>, uri <chr>,
## #   print_section <chr>, text_combined <chr>
## [1] "The dataframe with NYT articles from the year 1919 contains 78303 rows and 17 columns."
# Delete residual df from environment
rm(df, i)

Exercise 5 (10 points)

Use the NYT Archive data for 1918 which you downloaded in Exercise 4 to replicate the following plot as closely as possible. Also compute/print out the total observations/texts on which each of your three density estimates are based.

Hint: Before beginning with the plot, you will first need do some computations with the downloaded data (also see the plot annotations for clues).

# Load ggplot2
suppressMessages(library(ggplot2))

# Count the length of strings for headlines, snippets and combined texts
NYT_1918_count <- NYT_archive_1918_df %>%
  mutate(Headline = str_count(headline_processed), Snippet = str_count(snippet), combined_text_length = str_count(text_combined)) %>%
  select(Headline, Snippet, combined_text_length)

# Compute number of Headlines, Snippets and combined texts that have not null and not more than 500 characters
headlines <- NYT_1918_count %>% filter(Headline != 0 & Headline <= 500) %>% NROW()
snippets <- NYT_1918_count %>% filter(Snippet != 0 & Snippet <= 500) %>% NROW()
combined_texts <- NYT_1918_count %>% filter(combined_text_length != 0 & combined_text_length <= 500) %>% NROW()

# Pivot the dataframe to factorise the type of text and enable the construction of the plot, reorder factors, preset the colour combination
colnames(NYT_1918_count)[3] <- "Headline + snippet (combined texts)"
NYT_1918_long <- NYT_1918_count %>%
  pivot_longer(cols = colnames(NYT_1918_count)[1:length(colnames(NYT_1918_count))],names_to = "text", values_to = "length")
NYT_1918_long$text <- as.factor(NYT_1918_long$text)
NYT_1918_long$text <- factor(NYT_1918_long$text, levels = c("Headline", "Snippet", "Headline + snippet (combined texts)"))
colour_combination <- c("Headline" = "red", "Snippet" = "green", "Headline + snippet (combined texts)" = "blue")

# Create the plot
NYT_plot <- ggplot() +
  geom_density(data = NYT_1918_long, aes(x = length, fill = text), alpha = 0.15) + 
  scale_x_continuous(breaks = c(0,100,200,300,400,500), limits = c(1,500)) + 
  scale_fill_manual(values = colour_combination) +
  guides(color = "colorbar") +
  labs(title = "Characters in NYT texts from 1918", subtitle = "Density estimates", fill = "Type of text") +
  xlab("Number of characters") + ylab("Density") +
  geom_text(aes(x=430, label="Texts with zero or more than five hundred \n characters have been excluded for this plot", y=0.002), size=2.5) +
  geom_text(aes(x=120, label=paste("Headlines:", headlines), y=0.013), hjust = 0, size=3.2) +
  geom_text(aes(x=120, label=paste("Snippets:", snippets), y=0.012), hjust = 0, size=3.2) +
  geom_text(aes(x=120, label=paste("Combined texts:", combined_texts), y=0.011), hjust = 0, size=3.2) +
  theme_minimal() + 
  theme(panel.grid = element_blank(),
        axis.line = element_line(colour = "black"),
        axis.ticks = element_line(colour = "black"),
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "bottom",
        legend.key = element_rect(size = 0.05))

#Save plot
ggsave("NYT_plot_jsk.png", NYT_plot, path = "img", width = 8, height = 6)
## Warning: Removed 31244 rows containing non-finite values (stat_density).

scale=0.8


Exercise 6 (8 points)

Create an SQLite database called fb-small.sqlite and connect to it with the DBI package. Store the file posts.csv in the database as a table posts. The table may only contain the original information from posts.csv, all computations in this exercise have to be done with SQL.

  1. With a single SQL query through DBI, try to replicate the outcome from Exercise 1, i.e. the query should return the same 10 highest de-meaned comment to like ratios with screen names.

  2. Assuming that the posts table was a collection in a Mongo database and each row in it a document, write a MongoDB query to compute only the average comment to like ratio for each politician (no de-meaning required here, just the average comment to like ratio for each screen_name is sufficient for this part). You can write this MongoDB query as plain text below, no need to create a Mongo database and connect to it via R.

# Load necessary packages
suppressMessages(library(DBI))
suppressMessages(library(tidyverse))

# Read in posts.csv and create SQL database, add posts table to database
posts <- read_csv("data/posts.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   screen_name = col_character(),
##   date = col_date(format = ""),
##   post_type = col_character(),
##   message = col_character(),
##   likes_count = col_double(),
##   comments_count = col_double(),
##   shares_count = col_double(),
##   love_count = col_double(),
##   haha_count = col_double(),
##   wow_count = col_double(),
##   angry_count = col_double(),
##   sad_count = col_double(),
##   gender = col_character(),
##   type = col_character(),
##   party = col_character()
## )
db <- dbConnect(RSQLite::SQLite(), "fb-small.sqlite")
dbWriteTable(db, "posts", posts, overwrite=TRUE)

# a) With a SQL query, replicate the outcome from Exercise 1, i.e. the query should return the same 10 highest de-meaned comment to like ratios with screen names.
dbGetQuery(db, 'SELECT screen_name, (comments_count / likes_count) - (SELECT SUM(comments_count) / SUM(likes_count) AS avg_comment_like_ratio FROM posts WHERE likes_count != 0 GROUP BY screen_name) AS deamned_comment_like_ratio
           FROM posts
           WHERE likes_count != 0
           ORDER BY deamned_comment_like_ratio DESC
           LIMIT 10')
##                      screen_name deamned_comment_like_ratio
## 1                   susancollins                  17.569332
## 2  CongresswomanSheilaJacksonLee                  16.553564
## 3                  senatortoomey                  14.747682
## 4               RepClaudiaTenney                  10.893137
## 5                   RepJoeBarton                   9.889349
## 6                   SenDuckworth                   8.769668
## 7                  RepMikeBishop                   8.597682
## 8                   AndyHarrisMD                   8.514349
## 9                 mitchmcconnell                   8.431015
## 10                     RepMullin                   7.847682
# Disconnect from Database
dbDisconnect(db)

Mongo DB query for the average comment to like ratio: “collection$aggregate(‘[{“$match” : {“likes_count”: {“$ne”: 0}}}, {“$project”: {"_id“: false,”screen_name“: true,”comment_like_ratio“: {”$divide“: ["$comments_count", "$likes_count"]}}}, {”$limit": 10}]’)”

Exercise 7 (22 points)

In this exercise, the goal is to use two APIs to download data, to create a relational database covid.sqlite, and to visualise and discuss developments and properties of the current pandemic that you find interesting and noteworthy. You can use the covid and policy API from the University of Oxford https://covidtracker.bsg.ox.ac.uk/about-api and the Athena API from the World Health Organizsation https://www.who.int/data/gho/info/athena-api to download data. You can e.g. use the Oxford API to download time series of cases, deaths, or policy strengths, and the WHO Athena API to download other relevant information such as statistics and health indicators. For both APIs there exists plenty of documentation and examples on their websites.

Once you have downloaded your data from the APIs through code in R, the next step is to create a new SQLite database covid.sqliteand to store the data in two tables such that it avoids unnecessary redundancies (also see Hint 1 below). If important for your data organisation and analysis, you can add more tables unless they create redundancies, however, only the following two are required and the rest is an addition.

country_code country_name population another indicator
UK United Kingdom 42 42
FR France 42 42
country_code date cases deaths
UK 2020-01-01 0 0
UK 2020-01-02 0 0
UK 2020-01-03
  1. Study the two APIs and obtain data from them (through R) that you find interesting and need for your later analysis. After you processed the data in R, create a database and store the distinct tables in it (no need to store any data in the database that you do not need in the later analysis).

  2. Run SQLs queries which return the first five rows and all columns of your tables (to give the reader a preview of what you have collected). Also run SQL queries to return the total amount of rows of your tables.

  3. Run a query (or queries) with a join command to show that it is possible to join your tables. Return only the first five rows of the joined table, and also return the total number of rows in the joined table with a query.

  4. Visualise the data you collected with ggplot2 and feel free to also discuss it with the help of other approaches and packages if helpful. Note that once you have queried the relevant data for your analysis from the database you created in part c), you can do all subsequent analysis in R for this part d). This resembles that data is typically stored in databases, but after parts have been extracted for an analysis, is is usually further processed in software such as R.

Hint 1: The previous SQLite database in Exercise 6 contained a single table only because the purpose of the exercise was to run one query. Actual relational databases contain multiple tables by definition. Recall that the reason why data is split into different tables in relational databases is that these databases are meant to store data. Data which is split into different tables avoids redundancies and is thereby more space efficient for storage. Merging the above two tables and storing only a single table in the database, for example, would have repeated country specific indicators in every daily/monthly row and hence taken up unnecessary storage space. Also note, that e.g. the country name is only stored in the country table in the example as it is unnecessary to also be stored in the time series table (as these tables can simply be joined on the country code whenever necessary). When analysing or visualising data on the other hand, the relevant information will likely be joined and extracted from the database. Following this logic, when storing data in your database, avoid redundant information where possible. In part d), however, extract data frames from the database as the basis for your subsequent processing and analysis with ggplot2 or other packages.

Hint 2: Your tables need a column on which they can be merged. If the different APIs return different country codes, you could e.g. use the countrycode package or equivalent to obtain standardised country codes.

Hint 3: If you choose to depict time series plots and the daily data of cases is too noisy, one approach is to use moving averages or other smoothed time series. In e.g. a 7 day moving average, the value of a given day can be computed as the average of the last six days and the current day.

Hint 4: The data returned by the Oxford API can become large as it contains information on many countries and days. The WHO API also includes many potential indicators and time series. Just focus on using and storing information on those countries and variables that you have decided to use in your later analysis.

# a) Study the two APIs and obtain data from them (__through R__) that you find interesting and need for your later analysis. After you processed the data in R, create a database and store the distinct tables in it (no need to store any data in the database that you do not need in the later analysis).

# Install package for Oxford Covid Tracker API, country_code formatting, rolling average, interactive charts and theme_ipsum
suppressMessages(library(oxcgrt))
suppressMessages(library(countrycode))
suppressMessages(library(zoo))
suppressMessages(library(plotly))
suppressMessages(library(hrbrthemes))

# Get data from 2nd of January 2020 until today
query <- get_json_time(from = "2020-01-02")
ox_covid_data <- get_data_time(query)

# Reformat countries to make them mapable
ox_covid_data$country_name[ox_covid_data$country_name == "United States"] <- "USA"
ox_covid_data$country_name[ox_covid_data$country_name == "United Kingdom"] <- "UK"
ox_covid_data$country_name[ox_covid_data$country_name == "Congo - Kinshasa"] <- "Democratic Republic of the Congo"
ox_covid_data$country_name[ox_covid_data$country_name == "Congo - Brazzaville"] <- "Republic of Congo"
ox_covid_data$country_name[ox_covid_data$country_name == "Czechia"] <- "Czech Republic"

#Convert date_value to character to make queryable
ox_covid_data_sql <- ox_covid_data
ox_covid_data_sql$date_value <- as.character(ox_covid_data_sql$date_value)

# This function formats outputs of the WHO api
who_api_to_dataframe <- function(indicator_codes){
  #Extract information from dataframe in dataframe
  ls_who_dfs <- list()
  i <- 1
  
  for(code in indicator_codes){
    # Get url for query
    query <- sprintf("http://apps.who.int:80/gho/athena/api/GHO/%s?format=json", code)
    # Parse data from JSON into list
    who_available <- fromJSON(query)
    # Extract relevant df from list
    df <- as.data.frame(who_available[["fact"]])
    df[,"year"] <- NA
    df[,"country_code"] <- NA
    df[,"country_name"] <- NA
    df$value <- df$value$numeric
    
    for(i in seq(1,nrow(df))){
      #Year
      country_year <- str_extract(unlist(df[[6]][[i]]$code),"[0-9]{4}") 
      year <- country_year[!is.na(country_year)]
      df[i,"year"] <- year
      
      #Country_name and country_code
      country_plus_code <- filter(df$Dim[[i]], category == "COUNTRY")
      country_code <- country_plus_code$code[1]
      df[i,"country_code"] <- country_code
      
    }
    
    #Add country name, select relevant columns, rename value column
    df$country_name <- countrycode(sourcevar = df$country_code,  origin = "iso3c", destination = "country.name")
    df <- df %>% select(country_code, country_name, year, value)
    #Source for this: https://community.rstudio.com/t/pass-a-variable-to-dplyr-rename-to-change-columnname/6907/2
    df <- rename(df, !!code := value)
    ls_who_dfs[[i]] <- df
    i <- i + 1
    assign(paste(code), df, envir=.GlobalEnv)
  }
  ls_who_dfs <- ls_who_dfs[lengths(ls_who_dfs) != 0]
  assign("ls_who_dfs", ls_who_dfs, envir=.GlobalEnv)
}

# Set up indicators to query from WHO API
# WHS9_86 = Population (in thousands) total 
# WHS6_102 = Hospital Beds per 10k
indicators <- c("WHS9_86", "WHS6_102")

# Set up and format dataframes from WHO API
who_api_to_dataframe(indicators)

# Reduce dataframes only to the countries present in all dataframes of all indicators
country_vector_first <- ls_who_dfs[[1]]$country_name[!is.na(ls_who_dfs[[1]]$country_name)]
country_vector_second <- ls_who_dfs[[2]]$country_name[!is.na(ls_who_dfs[[2]]$country_name)]
residual_countries <- country_vector_second[country_vector_first%in%country_vector_second]

## Transform into one "countries" dataframe by extracting a) only the countries present in all and b) the most recent year of recorded data for each indicator and merging the dataframes together into one

# Transformation into one "countries" dataframe
initial <- ls_who_dfs[[1]] %>%
  group_by(country_name) %>%
  filter(country_name %in% residual_countries) %>%
  filter(year < 2021) %>%
  filter(year == max(year)) %>%
  arrange(country_name)
countries <- initial

for(i in ls_who_dfs[2:length(ls_who_dfs)]){
  i <- i %>%
    group_by(country_name) %>%
    filter(country_name %in% residual_countries) %>%
    filter(year < 2021) %>%
    filter(year == max(year)) %>%
    arrange(country_name)
  countries <- cbind(countries, i)
}
## New names:
## * country_code -> country_code...1
## * country_name -> country_name...2
## * year -> year...3
## * country_code -> country_code...5
## * country_name -> country_name...6
## * ...
# Rename columns to ease working with this df
names(countries)[1] <- "country_code"
names(countries)[2] <- "country_name"
names(countries)[3] <- "year_WHS9_86"
names(countries)[7] <- "year_WHS6_102"

# Select columns of interest
countries <- countries %>%
  select(country_code, country_name, year_WHS9_86, WHS9_86, year_WHS6_102, WHS6_102)

# Check the column names
colnames(countries)
## [1] "country_code"  "country_name"  "year_WHS9_86"  "WHS9_86"      
## [5] "year_WHS6_102" "WHS6_102"
# a) Set up SQL Database
db_covid <- dbConnect(RSQLite::SQLite(), "covid_analysis.sqlite")
dbWriteTable(db_covid, "time_series", ox_covid_data_sql, overwrite = TRUE)
dbWriteTable(db_covid, "countries", countries, overwrite = TRUE)

# List tables of SQL database
dbListTables(db_covid)
## [1] "countries"   "time_series"
# b) Run SQLs queries which return the first five rows and all columns of your tables (to give the reader a preview of what you have collected). Also run SQL queries to return the total amount of rows of your tables.

# First five rows and count of rows of time_series  table
dbGetQuery(db_covid, 'SELECT *
           FROM time_series
           LIMIT 5')
##   date_value country_code country_name stringency_actual stringency
## 1 2020-01-02          ABW        Aruba                 0          0
## 2 2020-01-02          AFG  Afghanistan                 0          0
## 3 2020-01-02          AGO       Angola                 0          0
## 4 2020-01-02          ALB      Albania                 0          0
## 5 2020-01-02          AND      Andorra                 0          0
##   stringency_legacy stringency_legacy_disp confirmed deaths
## 1                 0                      0        NA     NA
## 2                 0                      0        NA     NA
## 3                 0                      0        NA     NA
## 4                 0                      0        NA     NA
## 5                 0                      0        NA     NA
dbGetQuery(db_covid, 'SELECT COUNT(*) AS Number_of_rows
           FROM time_series')
##   Number_of_rows
## 1          69038
# First five rows and count of rows of countries table
dbGetQuery(db_covid, 'SELECT *
           FROM countries
           LIMIT 5')
##   country_code      country_name year_WHS9_86 WHS9_86 year_WHS6_102 WHS6_102
## 1          AFG       Afghanistan         2016   34656          2017  3.90000
## 2          ALB           Albania         2016    2926          2013 28.89300
## 3          DZA           Algeria         2016   40606          2015 19.00000
## 4          AGO            Angola         2016   28813          2005  8.00000
## 5          ATG Antigua & Barbuda         2016     101          2017 28.91845
dbGetQuery(db_covid, 'SELECT COUNT(*) AS Number_of_rows
           FROM countries')
##   Number_of_rows
## 1            176
# c) Run a query (or queries) with a join command to show that it is possible to join your tables. Return __only__ the first five rows of the joined table, and also return the total number of rows in the joined table with a query.
dbGetQuery(db_covid, 'SELECT time_series.country_code, countries.country_code, time_series.country_name, time_series.confirmed, time_series.date_value, countries.WHS6_102
           FROM time_series JOIN countries
           ON time_series.country_code = countries.country_code
           WHERE time_series.confirmed > 2000
           LIMIT 5')
##   country_code country_code country_name confirmed date_value WHS6_102
## 1          CHN          CHN        China      2062 2020-01-26     43.1
## 2          CHN          CHN        China      2863 2020-01-27     43.1
## 3          CHN          CHN        China      5494 2020-01-28     43.1
## 4          CHN          CHN        China      6070 2020-01-29     43.1
## 5          CHN          CHN        China      8124 2020-01-30     43.1
# In relation to the time_series databases (which has 67113 rows), the reduction of 9852 rows is likely due to a non-availability of a few country_codes in the countries database that do exist in the time_series database
dbGetQuery(db_covid, 'SELECT COUNT(*) AS Number_of_rows
           FROM time_series JOIN countries
           ON time_series.country_code = countries.country_code')
##   Number_of_rows
## 1          60793
# Disconnect from database
dbDisconnect(db_covid)

# d) Visualise the data you collected with `ggplot2` and feel free to also discuss it with the help of other approaches and packages if helpful.

#Compute population and hospital beds from Germany and UK
de_pop <- countries$WHS9_86[countries$country_name == "Germany"]*1000
de_beds <- countries$WHS6_102[countries$country_name == "Germany"]*10000
de_beds_rel_1k <- countries$WHS6_102[countries$country_name == "Germany"]/10
de_beds_rel_100k <- countries$WHS6_102[countries$country_name == "Germany"]*10
de_beds_rel_1m <- countries$WHS6_102[countries$country_name == "Germany"]*100
uk_pop <- countries$WHS9_86[countries$country_name == "United Kingdom"]*1000
uk_beds <- countries$WHS6_102[countries$country_name == "United Kingdom"]*10000
uk_beds_rel_1k <- countries$WHS6_102[countries$country_name == "United Kingdom"]/10
uk_beds_rel_100k <- countries$WHS6_102[countries$country_name == "United Kingdom"]*10
uk_beds_rel_1m <- countries$WHS6_102[countries$country_name == "United Kingdom"]*100

# Modify oxford_covid_table for Germany and the UK
# Source for rolling averages: https://datavizpyr.com/how-to-make-time-series-plot-with-rolling-mean-in-r/
germany_covid <- ox_covid_data[ox_covid_data$country_name == "Germany",] %>%
  mutate(confirmed_seven_average= rollmean(confirmed, 7, align="left", fill=0)) %>%
  mutate(deaths_seven_average= rollmean(deaths, 7, align="left", fill=0)) %>%
  mutate(daily_new_cases = confirmed - lag(confirmed)) %>%
  mutate(daily_new_deaths = deaths - lag(deaths)) %>%
  mutate(daily_new_cases_per_100k = (100000/de_pop) * daily_new_cases) %>%
  drop_na()
  
uk_covid <- ox_covid_data[ox_covid_data$country_code == "GBR",] %>%
  mutate(confirmed_seven_average= rollmean(confirmed, 7, align="left", fill=0)) %>%
  mutate(deaths_seven_average= rollmean(deaths, 7, align="left", fill=0)) %>%
  mutate(daily_new_cases = confirmed - lag(confirmed)) %>%
  mutate(daily_new_deaths = deaths - lag(deaths)) %>%
  mutate(daily_new_cases_per_100k = (100000/uk_pop) * daily_new_cases) %>%
  drop_na()

# Disable scientific notation to ease readability of plots
options(scipen=999)

## Create plots ##

# uk_ger_covid_case_avg_pop
uk_ger_covid_case_avg_pop <- ggplot() +
  geom_line(data = germany_covid, aes(x = date_value, y = confirmed_seven_average/de_pop), colour = "red", alpha = 0.75) +
  geom_line(data = uk_covid, aes(x = date_value, y = confirmed_seven_average/uk_pop), colour = "blue", alpha = 0.75) +
  scale_x_continuous(breaks = as.Date(c('2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Apr 2020", "Jun 2020", "Aug 2020", "Oct 2020", "Dec 2020"), limits = as.Date(c('2020-03-06','2021-01-08'))) + 
  labs(title = "Cumulative Covid-19 Cases / Population Size in GER and UK", subtitle = "Rolling average over 7 days") +
  xlab("") + ylab("Cumulative confirmed cases / Population Size") +
  geom_text(data = germany_covid, aes(x = date_value[350], label="GER", y =0.025), size=3) + 
  geom_text(data = uk_covid, aes(x = date_value[350], label="GBR", y = 0.049), size=3) +
  theme_ipsum() +
  theme(plot.title.position = "plot")

# uk_ger_covid_cases
uk_ger_covid_cases <- ggplot() +
  geom_line(data = germany_covid, aes(x = date_value, y = confirmed_seven_average), colour = "red", alpha = 0.75) +
  geom_line(data = uk_covid, aes(x = date_value, y = confirmed_seven_average), colour = "blue", alpha = 0.75) +
  scale_x_continuous(breaks = as.Date(c('2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Apr 2020", "Jun 2020", "Aug 2020", "Oct 2020", "Dec 2020"), limits = as.Date(c('2020-03-06','2021-01-08'))) + 
  scale_y_continuous(breaks = c(0, 500000, 1000000, 1500000, 2000000, 2500000, 3000000), labels=c(0, "500K", "1M", "1.5M", "2M", "2.5M", "3M")) + 
  labs(title = "Cumulative confirmed Covid-19 cases in Germany and the UK", subtitle = "Rolling average over 7 days") +
  xlab("") + ylab("Cumulative confirmed cases") +
  geom_text(data = germany_covid, aes(x = date_value[350], label="GER", y =2050000), size=3) + 
  geom_text(data = uk_covid, aes(x = date_value[350], label="GBR", y = 3150000), size=3) +
  theme_ipsum() +
  theme(plot.title.position = "plot")

# uk_ger_covid_deaths_avg_pop
uk_ger_covid_deaths_avg_pop <- ggplot() +
  geom_line(data = germany_covid, aes(x = date_value, y = deaths_seven_average/de_pop), colour = "red", alpha = 0.75) +
  geom_line(data = uk_covid, aes(x = date_value, y = deaths_seven_average/uk_pop), colour = "blue", alpha = 0.75) +
  scale_x_continuous(breaks = as.Date(c('2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Apr 2020", "Jun 2020", "Aug 2020", "Oct 2020", "Dec 2020"), limits = as.Date(c('2020-03-06','2021-01-08'))) + 
  labs(title = "Cumulative Covid-19 Deaths / Population Size in GER and UK", subtitle = "Rolling average over 7 days") +
  xlab("") + ylab("Cumulative confirmed deaths / Population Size") +
  geom_text(data = germany_covid, aes(x = date_value[350], label="GER", y =0.00054), size=3) + 
  geom_text(data = uk_covid, aes(x = date_value[350], label="GBR", y = 0.0013), size=3) +
  theme_ipsum(axis_title_just = "rt") +
  theme(plot.title.position = "plot")

# uk_ger_covid_deaths_avg
uk_ger_covid_deaths <- ggplot() +
  geom_line(data = germany_covid, aes(x = date_value, y = deaths_seven_average), colour = "red", alpha = 0.75) +
  geom_line(data = uk_covid, aes(x = date_value, y = deaths_seven_average), colour = "blue", alpha = 0.75) +
  scale_x_continuous(breaks = as.Date(c('2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Apr 2020", "Jun 2020", "Aug 2020", "Oct 2020", "Dec 2020"), limits = as.Date(c('2020-03-06','2021-01-08'))) + 
  scale_y_continuous(breaks = c(0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000), labels=c(0, "10k", "20K", "30K", "40K", "50K", "60K", "70K", "80K")) + 
  labs(title = "Cumulative Covid-19 deaths in Germany and the UK", subtitle = "Rolling average over 7 days") +
  xlab("") + ylab("Cumulative confirmed deaths") +
  geom_text(data = germany_covid, aes(x = date_value[350], label="GER", y =45000), size=3) + 
  geom_text(data = uk_covid, aes(x = date_value[350], label="GBR", y = 84000), size=3) +
  theme_ipsum(axis_title_just = "rt") +
  #theme_economist(dkpanel = TRUE) + 
  theme(plot.title.position = "plot")

# uk_ger_daily_new_cases_per_100k
uk_ger_daily_new_cases_per_100k <- ggplot() +
  geom_col(data = germany_covid, aes(x = date_value, y = -daily_new_cases_per_100k), fill = "red", alpha = 0.75) +
  geom_col(data = uk_covid, aes(x = date_value, y = daily_new_cases_per_100k), fill = "blue", alpha = 0.75) +
  scale_y_continuous(breaks = c(-70, -60, -50, -40, -30, -20, -10, 0, 10, 20, 30, 40, 50, 60, 70), labels = c("70", "60", "50", "40", "30", "20", "10", 0, "10", "20", "30", "40", "50", "60", "70")) +
  scale_x_continuous(breaks = as.Date(c('2020-02-01','2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Feb 2020", "April 2020", "June 2020", "August 2020", "October 2020", "December 2020"), limits = as.Date(c('2020-02-20','2021-01-08'))) +
  labs(title = "Daily new confirmed Covid-19 cases \nper 100k people in Germany and the UK") + 
  xlab("") + ylab("Daily new cases per 100k") +
  geom_text(data = germany_covid, aes(x = date_value[45], label="GBR", y =12), size=5) + 
  geom_text(data = uk_covid, aes(x = date_value[45], label="GER", y = -12), size=5) +
  geom_hline(aes(yintercept = -de_beds_rel_1k),  alpha = 0.2, colour = "red", linetype = "dashed") +
  annotate(geom = "text", x = germany_covid$date_value[225], y = -de_beds_rel_1k-2, label = paste("Hospital Beds\nper 1,000:",de_beds_rel_1k), color = "red", angle = 0, hjust = 1) +
  geom_hline(aes(yintercept = uk_beds_rel_1k), alpha = 0.2, colour = "blue", linetype = "dashed") +
  annotate(geom = "text", x = germany_covid$date_value[225], y = uk_beds_rel_1k+2, label = paste("Hospital Beds \nper 1,000:",uk_beds_rel_1k), color = "blue", angle = 0, hjust = 0) +
  theme_ipsum(axis_title_just = "rm") +
  #theme_economist(dkpanel = TRUE) + 
  theme(plot.title.position = "plot") + 
  coord_flip()

# uk_ger_daily_new_deaths
uk_ger_daily_new_deaths <- ggplot() +
  geom_col(data = germany_covid, aes(x = date_value, y = -daily_new_deaths), fill = "red", alpha = 0.75) +
  geom_col(data = uk_covid, aes(x = date_value, y = daily_new_deaths), fill = "blue", alpha = 0.75) +
  scale_x_continuous(breaks = as.Date(c('2020-02-01','2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Feb 2020", "April 2020", "June 2020", "August 2020", "October 2020", "December 2020"), limits = as.Date(c('2020-03-06','2021-01-08'))) +
  scale_y_continuous(breaks = c(-1000, -750, -500, -250, 0, 250, 500, 750, 1000), labels = c(1000, 750, 500, 250, 0, 250, 500, 750, 1000)) + 
  labs(title = "Daily new confirmed Covid-19 deaths in Germany and the UK") +
  xlab("") + ylab("Daily new deaths") +
  geom_text(data = germany_covid, aes(x = germany_covid$date_value[55], label="GBR", y =140), size=5) + 
  geom_text(data = uk_covid, aes(x = date_value[55], label="GER", y = -140), size=5) +
  theme_ipsum(axis_title_just = "rm") +
  #theme_economist(dkpanel = TRUE) + 
  theme(plot.title.position = "plot") + 
  coord_flip()

# Save plots
ggsave("uk_ger_covid_case.png", uk_ger_covid_cases, path = "img", width = 8, height = 6)
ggsave("uk_ger_covid_case_avg_pop.png", uk_ger_covid_case_avg_pop, path = "img", width = 8, height = 6)
ggsave("uk_ger_covid_deaths.png", uk_ger_covid_deaths, path = "img", width = 8, height = 6)
ggsave("uk_ger_covid_deaths_avg_pop.png", uk_ger_covid_deaths_avg_pop, path = "img", width = 8, height = 6)
ggsave("uk_ger_daily_new_deaths.png", uk_ger_daily_new_deaths, path = "img", width = 8, height = 6)
ggsave("uk_ger_daily_new_cases_per_100k.png", uk_ger_daily_new_cases_per_100k, path = "img", width = 8, height = 6)

## Make maps of global stringency measures by selected months (January (pre-pandemic), April (first wave), July (summer, small wave), October (second large wave), December (later stage of second large wave))
#Adjust dataframe for January 2020
ox_covid_data_map_january <- ox_covid_data %>%
  filter(date_value == as.Date("2020-01-02")) %>%
  select(country_code, country_name, stringency_actual)

#Adjust dataframe for April 2020
ox_covid_data_map_april <- ox_covid_data %>%
  filter(date_value == as.Date("2020-04-01")) %>%
  select(country_code, country_name, stringency_actual)

#Adjust dataframe for July 2020
ox_covid_data_map_july <- ox_covid_data %>%
  filter(date_value == as.Date("2020-07-01")) %>%
  select(country_code, country_name, stringency_actual)

#Adjust dataframe for October 2020
ox_covid_data_map_october <- ox_covid_data %>%
  filter(date_value == as.Date("2020-10-01")) %>%
  select(country_code, country_name, stringency_actual)

#Adjust dataframe for December 2020
ox_covid_data_map_december <- ox_covid_data %>%
  filter(date_value == as.Date("2020-12-01")) %>%
  select(country_code, country_name, stringency_actual)

#Adjust dataframe for January 2021
ox_covid_data_map_january_2021 <- ox_covid_data %>%
  filter(date_value == as.Date("2021-01-08")) %>%
  select(country_code, country_name, stringency_actual)

# Create a data frame with global map data
map.data <- map_data("world")

# Map January
map_stringency_january <- ggplot(ox_covid_data_map_january, aes(map_id = country_name)) +
  geom_map(aes(fill = stringency_actual), map = map.data, color = "white", size = 0.25) +
  expand_limits(x = map.data$long, y = map.data$lat) +
  scale_fill_gradient(low = "56B1F7", high = "132B43", limits = c(0,100)) + 
  coord_fixed(1.3) +
  labs(title = "Global Stringency Measures in January 2020") +
  # Removing unnecessary graph elements
       theme(axis.line = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            axis.title = element_blank(),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())

# Map April
map_stringency_april <- ggplot(ox_covid_data_map_april, aes(map_id = country_name)) +
  geom_map(aes(fill = stringency_actual), map = map.data, color = "white", size = 0.25) +
  expand_limits(x = map.data$long, y = map.data$lat) +
  scale_fill_gradient(low = "56B1F7", high = "132B43", limits = c(0,100)) + 
  coord_fixed(1.3) +
  labs(title = "Global Stringency Measures in April 2020") +
  # Removing unnecessary graph elements
       theme(axis.line = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            axis.title = element_blank(),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())

# Map July
map_stringency_july <- ggplot(ox_covid_data_map_july, aes(map_id = country_name)) +
  geom_map(aes(fill = stringency_actual), map = map.data, color = "white", size = 0.25) +
  expand_limits(x = map.data$long, y = map.data$lat) +
  scale_fill_gradient(low = "56B1F7", high = "132B43", limits = c(0,100)) + 
  coord_fixed(1.3) +
  labs(title = "Global Stringency Measures in July 2020") +
  # Removing unnecessary graph elements
       theme(axis.line = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            axis.title = element_blank(),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())


# Map October
map_stringency_october <- ggplot(ox_covid_data_map_october, aes(map_id = country_name)) +
  geom_map(aes(fill = stringency_actual), map = map.data, color = "white", size = 0.25) +
  expand_limits(x = map.data$long, y = map.data$lat) +
  scale_fill_gradient(low = "56B1F7", high = "132B43", limits = c(0,100)) +
  coord_fixed(1.3) +
  labs(title = "Global Stringency Measures in October 2020") +
  # Removing unnecessary graph elements
       theme(axis.line = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            axis.title = element_blank(),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())

# Map December
map_stringency_december <- ggplot(ox_covid_data_map_december, aes(map_id = country_name)) +
  geom_map(aes(fill = stringency_actual), map = map.data, color = "white", size = 0.25) +
  expand_limits(x = map.data$long, y = map.data$lat) +
  scale_fill_gradient(low = "56B1F7", high = "132B43", limits = c(0,100)) + 
  coord_fixed(1.3) +
  labs(title = "Global Stringency Measures in December 2020") +
  # Removing unnecessary graph elements
       theme(axis.line = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            axis.title = element_blank(),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())

# Map January 2021
map_stringency_january_2021 <- ggplot(ox_covid_data_map_january_2021, aes(map_id = country_name)) +
  geom_map(aes(fill = stringency_actual), map = map.data, color = "white", size = 0.25) +
  expand_limits(x = map.data$long, y = map.data$lat) +
  scale_fill_gradient(low = "56B1F7", high = "132B43", limits = c(0,100)) + 
  coord_fixed(1.3) +
  labs(title = "Global Stringency Measures in January 2021") +
  # Removing unnecessary graph elements
       theme(axis.line = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            axis.title = element_blank(),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())


#Save the maps
ggsave("map_stringency_january.png", map_stringency_january, path = "img", width = 8, height = 5)
ggsave("map_stringency_april.png", map_stringency_april, path = "img", width = 8, height = 5)
ggsave("map_stringency_july.png", map_stringency_july, path = "img", width = 8, height = 5)
ggsave("map_stringency_october.png", map_stringency_october, path = "img", width = 8, height = 5)
ggsave("map_stringency_december.png", map_stringency_december, path = "img", width = 8, height = 5)
ggsave("map_stringency_january_2021.png", map_stringency_january_2021, path = "img", width = 8, height = 5)


# Create response failure rating, taking into account stringency measures and hospital beds available
# !!!This is an exploratory visualization!!!
germany_covid <- germany_covid %>% mutate(response_failure = daily_new_cases_per_100k / de_beds * stringency_actual)
uk_covid <- uk_covid %>% mutate(response_failure = daily_new_cases_per_100k / uk_beds * stringency_actual)

# uk_ger_response_plot
uk_ger_response_plot <- ggplot() +
  geom_line(data = germany_covid, aes(x = date_value, y = response_failure), colour = "red", alpha = 0.75) +
  geom_line(data = uk_covid, aes(x = date_value, y = response_failure), colour = "blue", alpha = 0.75) +
  scale_x_continuous(breaks = as.Date(c('2020-02-01','2020-04-01', '2020-06-01', '2020-08-01', '2020-10-01', '2020-12-01')), labels = c("Feb 2020", "April 2020", "June 2020", "August 2020", "October 2020", "December 2020"), limits = as.Date(c('2020-02-20','2021-01-08'))) +
  labs(title = "Response Failure Rating in GER and UK", subtitle = "Daily new confirmed Covid-19 cases per 100k people / hospital beds * stringency index") + 
  xlab("Date and Year") + ylab("Response Rating") +
  geom_text(data = germany_covid, aes(x = date_value[352], label="GER", y =0.001), size=3) + 
  geom_text(data = uk_covid, aes(x = date_value[352], label="GBR", y = 0.0335), size=3) +
  theme_ipsum()

ggsave("uk_ger_response_plot.png", uk_ger_response_plot, path = "img", width = 8, height = 5)

In a direct comparison, the UK and Germany appear to have faced a very similar Covid-19 situation in 2021. However, since the discovery of the more rapidly spreading Covid-19 variant, the UK appears to face a situation much more dire than the Germans (source here). The following set of visualisations will give insight into the distinctive situation between Germany and the UK and close with an overview of the global response to Covid-19 over time.



In the beginning of the pandemic in 2020, cases in Germany and the UK accumulate at a visually close to similar rate. However, after adjusting these numbers for population sizes (Germany: 81,915,000 | UK: 65,789,000) (to be seen below), it becomes clear that the UK does face a lot more cases as a share of the population. The graphs “Cumulative Cases / Population Size in GER and UK” and “Cumulative Deaths / Population Size in GER and UK” below showcase this.



Although the UK has recorded significantly more deaths in the first large wave of this pandemic (see graph below), the slope of deaths accumulating in the end of 2020 appears similar again for both countries (graph above). Nonetheless, in light of a newly discovered variant of Covid-19 called B1.1.7 that spreads 74% faster, cases in the UK have been surging more heavily recently, causing a dramatic surge in deaths in the UK that is not included in this dataset (reaching until the 8th of January but currently observable in daily reports of the government (source here).



Furthermore, including another variable into this analysis aids to display the severity of the pandemic-situation in the UK. The healthcare system’s capacity (which I approximate with beds available to the population proportional to its size) in the UK is far inferior to that in the Germany, making the increased number of cases and deaths in the UK relative to Germany a more significant problem, explaining hospitals overflowing and a newly introduced lockdown until the 15th of Februrary. The graph below illustrate this well.



Lastly, a word about the stringency measures. When looking at a map of all countries, the timeline of the pandemic is largely observable, without any government restrictions in January, significant global restrictions in April, relaxing restictions in July and then interestingly (given the magnitude of the second wave that hit especially Europe in October) only a marginal increase in restrictions in October and December and again only marginal increases in a few countries in January 2021 (to be seen below). One reason explaining this might be increasing pressure resting on the economic performance and security of countries, forcing governments to balance economic activity and public health protection measures.



With respect to Germany and the UK, the author choose to construct a basic index (response failure index), that takes into account daily confirmed cases, stringency measures (i.e. severity of government restrictions) and hospital beds (to approximate the capacity of the healthcare system. For that the author calculated for each date the daily confirmed cases as a share of the national population, divided the product of a) the number of hospital beds available and b) the stringency index of the respective country on that day. This shows again the difference of the severity of the pandemic-situation between the two countries Germany and the United Kingdom, particularly the recent government response rating in the UK, illustrating the poorly coordinated response to the more infectious variant of the virus.



Exercise 8 (18 points)

In 1918 and 1919, the world had to cope with three waves of a global pandemic, the Spanish Flu. The attached figure shows historical infections in New York, Berlin, and Paris. With the NYT Archive and the two data frames for 1918 and 1919 that you downloaded with your function from Exercise 4, you have the chance to study the “influenza” and its effect on societies at that time, public discussion about it, measures taken against it, etc. In this exercise, explore, discuss, and visualise information from the archive about this historical pandemic (e.g. with time series plots of word counts, with word clouds of selected articles, by depicting articles, or any other approach). For this you can use quanteda, the tidyverse, or any other package. For example, do you find discussion of similar policy measures as the ones taken at the moment? Were there articles early in 1918 that seem to have underestimated the threat? Feel free to explore whatever question and topics about the pandemic you find most interesting. You can use all data from the NYT archive of these years and also use further data sources with references if helpful.



Image source: https://en.wikipedia.org/wiki/Spanish_flu

Hint 1: When initially exploring the downloaded data for 1918 and 1919, you can also have a look at the .csv files/search through them in Excel or similar programs. If the files are too large and slow to load, you can delete some columns that you do not need for your analysis in R before saving the .csv files.

Hint 2: As the free archive only returns articles up to one paragraph length, you can use the texts_combined column in your analysis when helpful if you want to make use of more available textual information. You can also have a look at the abstract column which contains some more texts during the years 1918 and 1919 than the snippet column.

#Base comparison of articles published by NYT 1918/1919 compared to NYT 2019/2020 in order allow for contextual interpretation (also include 1917 (for pre-pandemic reporting))
for(i in c(1917,2019,2020)){
  return_yearly_archive_data(i)
  print(paste("The dataframe with NYT articles from the year ", i," contains ", nrow(df), " rows and ", ncol(df), " columns.", sep = ""))
  assign(paste("NYT_archive_",i, "_df", sep = ""),df)
}
## # A tibble: 74,729 x 17
##    abstract web_url snippet lead_paragraph print_page source headline_proces…
##    <chr>    <chr>   <chr>   <chr>          <chr>      <chr>  <chr>           
##  1 "J. All… https:… "J. Al… ""             8          The N… WAR UNTIL VICTO…
##  2 "Athlet… https:… "Athle… ""             11         The N… GATHER FOR JUNI…
##  3 ""       https:… ""      ""             11         The N… ARRIVAL OF BUYE…
##  4 "Dec. t… https:… "Dec. … ""             14         The N… DECEMBER TRANSA…
##  5 "Conven… https:… "Conve… ""             3          The N… 45 IDIOT GIRLS …
##  6 ""       https:… ""      ""             11         The N… FOWL ENTERTAIN …
##  7 "Commen… https:… "Comme… ""             11         The N… Comment on Curr…
##  8 "Commit… https:… "Commi… ""             9          The N… To Hold Aqueduc…
##  9 "editor… https:… "edito… ""             8          The N… THE SOCIALIST'S…
## 10 "promot… https:… "promo… ""             4          The N… HAIG A FIELD MA…
## # … with 74,719 more rows, and 10 more variables: pub_date <chr>,
## #   document_type <chr>, news_desk <chr>, section_name <chr>,
## #   type_of_material <chr>, `_id` <chr>, word_count <int>, uri <chr>,
## #   print_section <chr>, text_combined <chr>
## [1] "The dataframe with NYT articles from the year 1917 contains 74729 rows and 17 columns."
## # A tibble: 53,258 x 18
##    abstract web_url snippet lead_paragraph source headline_proces… pub_date
##    <chr>    <chr>   <chr>   <chr>          <chr>  <chr>            <chr>   
##  1 "From t… https:… "From … "Throughout 2… The N… 1919: The Year … 2019-01…
##  2 "Imagin… https:… "Imagi… "More than th… The N… In Search of Lo… 2019-01…
##  3 "Wells … https:… "Wells… "Warren Wells… The N… Warren Wells, S… 2019-01…
##  4 "Can th… https:… "Can t… "In Willa Cat… The N… 2019: The Year … 2019-01…
##  5 "The Ch… https:… "The C… "The month be… The N… Why Trump Reign… 2019-01…
##  6 "The qu… https:… "The q… "FLORHAM PARK… The N… Wanted: A Jets … 2019-01…
##  7 "The Un… https:… "The U… "The unified … The N… Military Delete… 2019-01…
##  8 "There … https:… "There… "The N.F.L.’s… The N… Firing of Sever… 2019-01…
##  9 "Quotat… https:… "Quota… "“It’s a sham… The N… Quotation of th… 2019-01…
## 10 "The te… https:… "The t… "A Turkish te… The N… Video on Turkis… 2019-01…
## # … with 53,248 more rows, and 11 more variables: document_type <chr>,
## #   news_desk <chr>, section_name <chr>, type_of_material <chr>, `_id` <chr>,
## #   word_count <int>, uri <chr>, print_section <chr>, print_page <chr>,
## #   subsection_name <chr>, text_combined <chr>
## [1] "The dataframe with NYT articles from the year 2019 contains 53258 rows and 18 columns."
## # A tibble: 55,493 x 18
##    abstract web_url snippet lead_paragraph print_section print_page source
##    <chr>    <chr>   <chr>   <chr>          <chr>         <chr>      <chr> 
##  1 The gun… https:… The gu… "WHITE SETTLE… A             16         The N…
##  2 Congres… https:… Congre… "Congress inv… A             18         The N…
##  3 The tob… https:… The to… "The Trump ad… A             1          The N…
##  4 Christi… https:… Christ… "WEDNESDAY PU… <NA>          <NA>       The N…
##  5 Correct… https:… Correc… "An “On This … A             20         The N…
##  6 Quotati… https:… Quotat… "“He’s a Ramb… A             3          The N…
##  7 Attacks… https:… Attack… "Attacks on c… A             9          The N…
##  8 Weeks o… https:… Weeks … "HONG KONG — … A             6          The N…
##  9 A Los A… https:… A Los … "A lawsuit th… A             13         The N…
## 10 The pre… https:… The pr… "PALM BEACH, … A             15         The N…
## # … with 55,483 more rows, and 11 more variables: headline_processed <chr>,
## #   pub_date <chr>, document_type <chr>, news_desk <chr>, section_name <chr>,
## #   type_of_material <chr>, `_id` <chr>, word_count <int>, uri <chr>,
## #   subsection_name <chr>, text_combined <chr>
## [1] "The dataframe with NYT articles from the year 2020 contains 55493 rows and 18 columns."
# Count articles published every week in 2019 and 2020
NYT_2019_20_df <- rbind(NYT_archive_2019_df, NYT_archive_2020_df)
NYT_2019_20_df$pub_date <- as.numeric(strftime(NYT_2019_20_df$pub_date, format = "%V"))
base_present <- NYT_2019_20_df %>%
  group_by(pub_date) %>%
  summarise(articles_posted_per_week = n(), .groups = 'drop')

# Count articles published every week in 1918 and 1919
NYT_1918_19_df <- rbind(NYT_archive_1918_df, NYT_archive_1919_df)
NYT_1918_19_df$pub_date <- as.numeric(strftime(NYT_1918_19_df$pub_date, format = "%V"))
base_past <- NYT_1918_19_df %>%
  group_by(pub_date) %>%
  summarise(articles_posted_per_week = n(), .groups = 'drop')

# Visualise baseline comparison --> Finding: In 1918-1919, almost double the amount of articles was published compared to today
baseline_plot <- ggplot() +
  geom_line(data = base_past, aes(x = pub_date, y = articles_posted_per_week, colour = "1918-1919"), colour = "darkblue") +
  annotate(geom = "text", x = 46, y = 3150, label = "1918 - 1919", color = "darkblue", angle = 0, hjust = 0) + 
  geom_line(data = base_present, aes(x = pub_date, y = articles_posted_per_week, colour = "2019-2020"), colour = "darkgreen") +
  annotate(geom = "text", x = 46, y = 1350, label = "2019 - 2020", color = "darkgreen", angle = 0, hjust = 0) + 
  theme_ipsum() + ylab("Articles Posted per Week") + xlab("Week of Year") +
  labs(title = "Comparison between articles published by NYT\nper week in 1918-1919 and 2019-2020 (aggregated)") +
  scale_x_continuous(breaks = c(1, 13, 26, 39, 52), labels = c("January", "March", "June", "September", "December"), limits = c(1,52))

# Save baseline plot
ggsave("baseline_plot.png", baseline_plot, path = "img", width = 8, height = 6)
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Now, onto the investigation of the coverage of the pandemic in 1918 and 1919
# Explore the combined text of the dataset NYT_1918_19_df (adjust numbers in [] arbitrarily between 1 and 151,464, (put in any number in this range into the first line following to identify which time of which year you are exploring))
#NYT_1918_19_df$pub_date[1234]
#NYT_1918_19_df$text_combined[1234]

# Display article of interest
NYT_1918_19_df$text_combined[57506]
## [1] "DRASTIC STEPS TAKEN TO FIGHT INFLUENZA HERE; Health Board Issues 4 P.M. Closing Orders for All Stores Except Food and Drug Shops. HOURS FOR FACTORIES FIXED Plan, in Effect Today, to Reduce Crowding on Transportation Lines in Rush Periods. TIME TABLE FOR THEATRES Radical Regulations Necessary to Prevent Shutting City Up Tight, Says Dr. Copeland. 1,695 New Cases Here. DRASTIC STEPS TO FIGHT INFLUENZA Has Power to Close Up City. Theatre Opening Hours. Conditions Different Here. To Permit Fires Earlier. Health Bd issues opening and closing time orders for industries and meeting places to reduce crowding on transportation lines"
# Save screenshot of such article
NYT_1918_19_df$web_url[57506] # --> copy-paste to browser
## [1] "https://www.nytimes.com/1918/10/05/archives/drastic-steps-taken-to-fight-influenza-here-health-board-issues-4.html"
## Visualise mentions of the pandemic in 1917 (pre-pandemic) and 1918/1919 compared to 2019 (pre-pandemic) and 2020
# clean NYT_archive_1917_df (it contains a few articles published in 1851), Combine dfs, transform pub_date into date format
NYT_archive_1917_df_clean <- NYT_archive_1917_df %>%
  filter(pub_date >= as.Date("1900-01-01"))
NYT_1917_18_19_df <- rbind(NYT_archive_1917_df_clean, NYT_archive_1918_df, NYT_archive_1919_df)
NYT_1917_18_19_df$pub_date <- as.Date(NYT_1917_18_19_df$pub_date)
NYT_2019_20_df <- rbind(NYT_archive_2019_df, NYT_archive_2020_df)
NYT_2019_20_df$pub_date <- as.Date(NYT_2019_20_df$pub_date)

# Create corpus, remove puctuation and stopwords, create dfm, retain only words appearing twice or more
NYT_19_corpus <- corpus(NYT_1917_18_19_df, docid_field = "_id", text_field = "text_combined")
NYT_19_dfm <- NYT_19_corpus %>%
  tokens(remove_punct = TRUE) %>%
  tokens_remove(pattern = stopwords("en")) %>%
  dfm() %>%
  dfm_trim(min_termfreq = 2)

# Create dictionary with terms related to influenza
influenza_dict <- dictionary(list(influenza = c("*influenza*", "*infect*", "*pandemic*", "*virus*", "*disease*", "*spread*", "*epidemic*", "sweating sickness")))

# Create dfm with repsect to dictionary
NYT_dfm_19_influenza <- dfm_lookup(NYT_19_dfm, dictionary = influenza_dict, valuetype = "glob")

# Convert newly created dfm to df, group by pub_date and summarise the number of times influenza was mentioned
df_19_plot <- convert(NYT_dfm_19_influenza, to = "data.frame") %>%
  cbind(docvars(NYT_dfm_19_influenza)) %>%
  group_by(pub_date) %>%
  summarise(influenza = sum(influenza), .groups = 'drop')

# Mutate a new column containing a rolling average over 7 days (1 week), to smooth the plot
df_19_plot <- df_19_plot %>%
  mutate(influenza_rolling_avg= rollmean(influenza, 7, align="left", fill=0))

# Create plot of x=pub_date and y=influenza_rolling_avg
influenza_19_plot <- ggplot(data = df_19_plot) +
  geom_line(aes(x=pub_date, y=influenza_rolling_avg)) +
  theme_minimal() +
  labs(title = "Mentions of terms related to the Influenza pandemic\nin the New York Times in 1917, 1918 & 1919") + 
  xlab("Time") + ylab("Influenza mentions per Week") +
  theme_ipsum()

#Compare with 2019-2020 (i.e. repeat process from above for 2019-2020)
NYT_20_corpus <- corpus(NYT_2019_20_df, docid_field = "_id", text_field = "text_combined")

NYT_20_dfm <- NYT_20_corpus %>% tokens(remove_punct = TRUE) %>% tokens_remove(pattern = stopwords("en")) %>%
  dfm() %>% dfm_trim(min_termfreq = 2)
NYT_dfm_20_influenza <- dfm_lookup(NYT_20_dfm, dictionary = influenza_dict, valuetype = "glob")

df_20_plot <- convert(NYT_dfm_20_influenza, to = "data.frame") %>% cbind(docvars(NYT_dfm_20_influenza)) %>%
  group_by(pub_date) %>% summarise(influenza = sum(influenza), .groups = 'drop')
df_20_plot <- df_20_plot %>% mutate(influenza_rolling_avg= rollmean(influenza, 7, align="left", fill=0))

influenza_20_plot <- ggplot(data = df_20_plot) +
  geom_line(aes(x=pub_date, y=influenza_rolling_avg)) +
  theme_minimal() + labs(title = "Mentions of terms related to the Influenza pandemic\nin the New York Times in 2019 & 2020") + xlab("Time") + ylab("Influenza mentions per Week") + theme_ipsum()

#Save comparison plots
ggsave("influenza_19_plot.png", influenza_19_plot, path = "img", width = 8, height = 6)
ggsave("influenza_20_plot.png", influenza_20_plot, path = "img", width = 8, height = 6)

### Build a wordcloud
# Extract rows in which combined_text contains any of the words contained in vector pattern
pattern <- c("influenza|infect|pandemic|virus|disease|spread|epidemic|sweating sickness")
articles_influenza <- filter(NYT_1918_19_df, grepl(pattern, text_combined, ignore.case = TRUE))

# Wordcloud of articles mentioning terms related to the influenza pandemic
influenza_corpus <- corpus(articles_influenza, docid_field = "_id", text_field = "text_combined")
influenza_keywords <- c("influenza", "infect", "pandemic", "virus", "disease", "spread*", "epidemic", "sweating sickness")

# Create dfm, remove, punctuation, numbers, stopwords, influenza_keywords
influenza_dfm <- influenza_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(pattern = stopwords("en"), case_insensitive = TRUE) %>%
  tokens_remove(pattern = influenza_keywords, case_insensitive = TRUE) %>%
  dfm()

# Create Wordcloud and save it (source: https://www.datamentor.io/r-programming/saving-plot/)
png(file="img/wordcloud_influenza_NYT.png", width=800, height=400)
textplot_wordcloud(influenza_dfm, rotation=0, min_size=1, max_size=4, max_words=50, color = "darkblue")
dev.off()
## quartz_off_screen 
##                 2

Historical Context

The influenza pandemic, known as the Spanish Flu, took place close to the end of the first world war in 1918 and 1919. Today, estimates pinpoint infections during this pandemic to up to 500 million people and estimate deaths at around 20-50 million (Johnson, Mueller, 2002). For the following analysis, historical context is critical to understand the findings. As I intend to analyse this pandemic by means of New York Times articles, it is important to mention that a) War was the predominant topic of conversation that dominated the headlines throughout the pandemic b) political motives to retain morale (to fight in the war) in the beginning of the pandemic led to a delay of reporting the effects and extent to which the pandemic spread across the globe (this is also the reason it is known as the Spanish Flu, as Spain, neutral in the first world war, did not censor news reports of the novel disease and was hence starting to report about the disease first) (Trilla, Trilla & Daer, 2008). In other words, the circumstances were severely different in 1918 an 1919 compared to today. The New York Times itself underlined this argument on the 5th of January 2021, publishing an article with the following subtitle: “The influenza outbreak killed more than 20,000 New Yorkers and 675,000 Americans. It might have dominated the news, if not for World War I.” (New York Times, 2021). Another argument they propose for the lack of their coverage is that “the science of virology did not even exist, so nobody knew what caused influenza” (ibid.).

Furthermore, a difference in articles published in 1918 and 1919 compared to 2019 and 2020, is to be taken into account when analysing NYT articles with respect to the Spanish Flu in relation to Covid-19. As the graph below shows, a significant amount of article more were published in the years 1918 and 1919 (151464 articles) compared to the years 2019 and 2020 (108751 articles). Partly, this might be accredited to the war going on in 1918 and 1919, producing a large number of short notices about battles, war figures and other war-related news. Nonetheless, an investigation (that due to a lack of available data is impossible) of differences of word counts of articles from all years would be necessary to confirm this.



Analysis

Corresponding with historic reports, the figure below depicts how the coverage of the pandemic only started to pick up during the fall (autumn) of 1918 (in the second and most severe wave of the pandemic) (CDC, 2018). Nonetheless, surpisingly, in the beginning of 1919, where a third wave of the pandemic still killed thousand across the US and likely many more throughout the World, the density of reports mentioning influenza-related terms falls back close to pre-pandemic levels.



This differs to the explosion of mentions during the Covid-19 pandemic, where just briefly after the first report of a novel coronavirus in Wuhan the number of times terms related to an imminent pandemic (excluding any mentions of “covid”) (“influenza”, “infect”, “pandemic”, “virus”, “disease”, “spread”, “epidemic”, “sweating sickness”) were mentioned in articles of the New York Times (WHO, 2020).



Whereas causes for this remain to be investigated, one reason might be a more developed global information infrastructure which quickly facilitated an understanding of the severity of this pandemic (historical knowledge of the impact of the Spanish Flu might have also played a part in motivating quick and significant action). The New York Times reported cases for all countries during the covid pandemic, whereas the reporting of cases in 1918-1919 was predominantly limited to New York (see screenshot below).



An investigation of the language predominantly used in articles headlines and snippets (to be seen below in the wordcloud) confirms that news reports were in spite of mentioning the pandemic still heavily biased towards terms such as “army”, “war” and “german”. To sum up, although the Spanish Flu killed (although reports diverge about the numbers) millions more than the first world war, news reports of the New York Times have failed to acknowledge its severity. However, when comparing this lack of reporting to the recent Covid-19 pandemic, it does make the impression that the New York Times has learned from the impact of the Spanish Flu.



References (Task 8)

  1. CDC. (2018). 1918 Pandemic Influenza Historic Timeline. [online]. Available from: https://www.cdc.gov/flu/pandemic-resources/1918-commemoration/pandemic-timeline-1918.htm (Accessed 05.01.2021)

  2. Johnson, N.P. and Mueller, J., (2002). Updating the accounts: global mortality of the 1918-1920" Spanish" influenza pandemic.Bulletin of the History of Medicine, pp.105-115.

  3. New York Times. (2021). Echoes of Another Pandemic: How The Times Covered the 1918 Flu. [online]. Available from: https://www.nytimes.com/2021/01/05/insider/1918-pandemic.html (Accessed 05.01.2021)

  4. Trilla, A., Trilla, G. and Daer, C., (2008). The 1918 “Spanish flu” in Spain. Clinical infectious diseases,47(5), pp.668-673.

  5. WHO. (2020). Archived: WHO Timeline - COVID-19. [online]. Available from: https://www.who.int/news/item/27-04-2020-who-timeline---covid-19 (Accessed 05.01.2020)